Project Presentation

Group #6

Published

December 3, 2024

Team members

  • Zhuoyan Guo
  • Xue Qin
  • Congfei Yin

Predicting Aids Patient Survival with Different Treatment

Introduction

Background

HIV, or human immunodeficiency virus, is a virus that attacks the immune system by targeting white blood cells, making the body more vulnerable to illnesses like tuberculosis, infections, and certain cancers. Without treatment, HIV can progress over time to its most severe stage, AIDS (acquired immunodeficiency syndrome). The virus is transmitted through bodily fluids such as blood, breast milk, semen, and vaginal fluids, and can also be passed from mother to child.1

The symptom and signs are as follows:

  • fever

  • headache

  • rash

  • sore throat

HIV-Symptom

HIV-Symptom

Data sources

  • Data URL

  • It’s a public dataset contains records of patients who were diagnosis with AIDS.

  • The data contained 2139 intances, include integer and catugoriacal variables

Project Goal

The goal of this project is to predict Aids patient treatment failure (death) or survival (censoring) with a certain window of time by comparing two different treatments and looking at the clinical factors such as patient demographics, treatment indicators, and clinical factors such as CD4 counts and Karnofsky scores.

Methods

  • Survival Analysis

  • Logistic Regression

  • Logic Regression

  • Decision Tree

Methodology that will be included in this project would be survival analysis and machine learning methods. Kaplan-Meier Estimator and survival curve will help to compare the difference of using two treatments by visualizing probability of survival over time. The prediction models such as Linear Regression, Logistic Regression and Decision Tree will be used to predict the patients’ survival outcome. The evaluation of models will also proceed during the model development.

Questions to address

  1. How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
  2. What clinical features (e.g., CD4 counts, Karnofsky score) significantly impact survival outcomes for patients undergoing different treatments?
  • Which combination of CD4 counts and Karnofsky scores provides the most accurate prediction of patient outcomes for each treatment method?
  1. Which treatment method would be better?
  • How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
  1. What are the most important indicators/features that better give the diagnois result?
  • What are the most significant clinical factors influencing patient survival when considering CD cell counts and treatment methods?
  1. what is the best model to predict the diagnosis result?

Exploratory Analysis

Statistic Summary

  • Average survival time is around 879.1 days

  • The age of patients range from 12 to 70 years old and the average age is 35.25 years old.

  • The average weight is range from 31 kg to 159.94 kg and the average weight is 75.13kg

Feature Correlation Heatmap

  • The warmer color shows the positive correlated relationship
  • variable cd820 (CD8 at 20+/-5 weeks) and cd80 (CD8 at baseline) are highly correlated
  • cd420(CD4 at 20+/-5 weeks) and cd40(CD4 at baseline)

Distribution of Counts of Treatment Group

  • Each treatment group are almost equally distributed
  • Would not cause any bias about the final treatment results

Gender proportion by treatment group

  • The number of male patients are much larger than female patients
  • Every treatment is balanced distributed

Relationship between CD4 and CD8 counts

  • No big difference between CD4 and CD8 in each treatment group
  • No clear trend that shows the relationship between CD4 and CD8

Pie Chart for Diagnoisis Outcome distribution

Correlation Heatmap of Karnofsky score

  • The positive correlation between the Karnofsky score and CD4 counts suggests that patients with better functional abilities

Survival Analysis

By applying Survival Analysis, we want to answer the following questions:

  1. What are the survival probabilities associated with different treatment methods?

  2. How do clinical indicators or demographic factors affect survival outcomes?

We focus on estimating survival probabilities and comparing different groups using Kaplan-Meier curves and the Cox Proportional Hazards model.

Kaplan-Meier

  1. Analyze Survival Probabilities Across Treatments

Kaplan-Meier Curve for All Treatments
  • trt=0 showing a significant worse outcomes compared to the other three treatments indicating by the p-value < 0.0001.

  • Patients who toke trt=1, trt=2, and trt=3 shows similar survival curves.

  1. Focus on Effective Treatments: Survival Differences in trt=1,2,3

Kaplan-Meier Curve for Treatments 1, 2, and 3
  • The survival probabilities across these treatments are nearly indistinguishable according to the p-value = 0.4.
  • This similarity implies that external factors – other than treatments themselves –may play a role in shaping the survival outcomes.
  1. Explore Factors That Affect Survival Probabilities Using the Bootstrap Method

Gender:

  • For trt=0, gender significantly impacts survival probabilities (p-value = 0.024), with female showing higher survival probabilities.
  • Gender differences are less significant for trt=1, trt=2, and trt=3 (p-values = 0.35, 0.061, and 0.79).

Survival Curves Stratified by Gender

ARV History:

  • ARV history means the whether the patient has ever taken the antiretroviral treatments, which is the treatment of HIV/AIDS involves treatment of opportunistic infections, prophylaxis, and antiretroviral (ARV) therapy.
  • Using trt=1, antiretroviral history (naive vs. experienced) does not significantly impact survival outcomes for patients.
  • Using trt=0, trt=2, and trt=3, ARV history has a substantial impact on survival probabilities. Patients with prior ARV treatment show significantly lower survival probabilities compared to naive patients due to the impact of drug resistance.

Survival Curves Stratified by ARV History

Symptom

  • For trt=0 and trt=3, the p-values (0.11 and 0.38) indicate no statistically significant difference in survival probabilities between symptomatic and asymptomatic patients.
  • For trt=1, the p-value (0.0039) shows a significant difference, where asymptomatic patients exhibit slightly better survival probabilities compared to symptomatic patients.

Survival Curves Stratified by Symptom

Karnofsky Score

  • Karnofsky score measures a patient’s functional status on a scale from 0 to 100, with higher scores indicating better physical capability and ability to carry out daily activities.
  • There is no significant difference in survival probabilities for patients using treatment 0, 1, and 3.
  • Karnofsky score quartiles significantly impact survival probabilities (p = 0.038), with higher quartiles (Q3 and Q4) associated with improved survival outcomes if patients receiving trt=2.

Survival Curves Stratified by Karnofsky Quantile

CD4 at Baseline

  • The CD4 cell count is a critical measure of immune system health, which reflects the number of CD4 T-helper cells in the blood, which are essential for immune function.
  • There is a consistent and significant impact of CD4 quartiles on survival probabilities across all treatment groups: survival probabilities significantly improve with higher CD4 quantiles, saying that immune status is a key determinant of survival outcomes for all groups and patients with higher CD4 consistently demonstrate better survival probabilities.

Survival Curves Stratified by CD4 Quantile

Cox Proportional Hazard Models

  • The Kaplan-Meier (KM) curves provide a fundational view of survival probabilities by visualizing unadjusted survival outcomes across different strata.
  • The Cox proportional hazards model is a semi-parametric approach that allows for the simultaneous inclusion of multiple covariates. And it can adjust for confounding factors, ensuring a clearer understanding of each variable’s independent effect on survival.
  1. Cox Model with Interaction Terms

Adjusted Survival Curves by Treatment

Logistic Regression

Model Analysis

  • Confusion matrix
Prediction vs Actual 0 1
0 79 35
1 26 115
  • Model Results

Lasso Regression

Metric Value
Accuracy 77.25%
Sensitivity 0.7810
Specificity 0.7667
p-value 4.321e-10

Feature Impoartance

offtrt, cd420, race, hemo, symptom, drugs, gender, cd820, age, strat.

Decision Tree

Model Analysis

  • Confusion matrix

  • Model Results
Metric Value
Accuracy 72.94%
Sensitivity 0.7714
Specificity 0.7000
p-value 1.863e-06

Feature Impoartance

cd420, offtrt, age, preanti, cd40, gender

Random Forest

Model Analysis

  • Confusion matrix
0 1
0 82 36
1 23 114
  • Model Results
Metric Value
Accuracy 76.86%
Sensitivity 0.7810
Specificity 0.7600
p-value 0.1182

Feature Impoartance

offtrt, cd420, preanti, age, homo, z30, strat, gender, cd40, cd80

Conclusion

Answer to Questions

  1. How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
  • Gender
  • Antiretroviral History
  1. What clinical features (e.g., CD4 counts, Karnofsky score) significantly impact survival outcomes for patients undergoing different treatments?
  • Symptom
  • Karnofsky Score
  • CD4 counts at the baseline
  1. Which treatment method would be better?
  • trt=0 is the worst while other three treatments don’t have significant difference.

Tabel: Model Comparasion

Models Accuracy Results Important Features
Logistic Regression 0.7608 “offtrt” “cd420” “race” “hemo” “symptom” “drugs” “gender” “cd820” “age” “strat”
Decision Tree 0.7294 “cd420” “offtrt” “age” “preanti” “cd40” “gender”
Random Forest 0.7686 “offtrt” “cd420” “preanti” “age” “homo” “z30” “strat” “gender” “cd40” “cd80”
  1. What are the most important indicators/features that better give the diagnois result?
  • gender
  • age
  • offtrt: indicator of off-trt before 96+/-5 weeks (0=no,1=yes)
  • cd420: CD4 at 20+/-5 weeks
  1. what is the best model to predict the diagnosis result?
  • Random Forest Model has the highest prediction accuracy in predicting the diagnosis result.